Text similarity detection in web scraping data cleaning plays a critical role when collecting large-scale news and content data from multiple sources. Different websites often publish the same story or rewrite the same topic using slightly different wording.
To maintain data quality, scraping systems must determine whether two articles are similar. When similarity is high, the system can group articles under one topic or discard near-duplicate content. This article introduces two widely used approaches for text similarity detection in web scraping data cleaning, explains their principles, and analyzes their limitations.
Levenshtein Distance
Levenshtein distance (edit distance) measures how different two strings are. The algorithm counts the minimum number of single-character edits required to transform one string into the other.
The algorithm supports three operations:
- Insertion: Add a character (for example, change “abc” to “abdc” by inserting “d”)
- Deletion: Remove a character (for example, change “abc” to “ac” by deleting “b”)
- Replacement: Replace a character (for example, change “abc” to “adc” by replacing “b” with “d”)
Levenshtein Distance Example
Consider the strings “kitten” and “sitting”.
- “kitten” → “sitten” (replace “k” with “s”)
- “sitten” → “sittin” (replace “e” with “i”)
- “sittin” → “sitting” (insert “g”)
The process needs 3 operations, so the Levenshtein distance equals 3.
Python Library: python-Levenshtein
Python provides a popular implementation called python-Levenshtein.
Install it with:
pip install python-Levenshtein
Now compare two sentences and compute a normalized similarity score.
Test Inputs
text_a = "The quick brown fox jumps over the lazy dog"
text_b = "The quick brown fox jumps over the sleepy dog"
Example Code
from Levenshtein import distance # install python-Levenshtein first
def text_similarity_simple(text1, text2):
edit_dist = distance(text1, text2)
max_len = max(len(text1), len(text2))
return 1 - edit_dist / max_len if max_len > 0 else 1.0
text_a = "The quick brown fox jumps over the lazy dog"
text_b = "The quick brown fox jumps over the sleepy dog"
print(f"Similarity between A and B: {text_similarity_simple(text_a, text_b):.2f}") # ~0.91
The code outputs a similarity score of about 0.91, so the two sentences look very similar at the character level.
Why Levenshtein Distance Fails on Semantics
Levenshtein distance works well for spelling correction and simple string matching. However, the algorithm focuses on surface characters and ignores meaning. That design causes problems when you compare long text, complex semantics, or domain-specific content.
1) The algorithm ignores semantic meaning
Levenshtein distance counts edit operations. It does not evaluate what words mean.
This behavior creates two common errors:
- The algorithm may treat two unrelated words as similar when their characters overlap.
- The algorithm may treat two equivalent meanings as dissimilar when their surface forms differ.
Example:
- Word A: “apple” (fruit)
- Word B: “apply” (verb)
The two words differ by one character, so the algorithm reports high similarity. In practice, the meanings have no relationship.
TF-IDF + Cosine Similarity (Word Frequency Matching)
Many data cleaning pipelines also use TF-IDF + cosine similarity to compare sentences or articles.
This approach uses two steps:
- TF-IDF assigns weights to words based on importance.
- Cosine similarity compares two TF-IDF vectors and returns a score between 0 and 1.
How TF-IDF Works
TF-IDF calculates a weight for each word by multiplying:
- TF (Term Frequency): how often a word appears in a document
- IDF (Inverse Document Frequency): how rare the word is across the full corpus
Step 1: TF (Term Frequency)
TF measures frequency inside one document.
A simple formula looks like this:
TF(t, d) = count(t in d) / total_words(d)
Example:
Document: d = "The cat chases the mouse"
Tokens: ["The", "cat", "chases", "The", "mouse"]
- Total tokens: 5
- TF(The, d) = 2/5 = 0.4
- TF(cat, d) = 1/5 = 0.2
- TF(chases, d) = 1/5 = 0.2
- TF(mouse, d) = 1/5 = 0.2
Step 2: IDF (Inverse Document Frequency)
TF alone overvalues common words such as “the” or “is”. IDF reduces their impact and highlights rare words.
A common IDF formula looks like this:
IDF(t) = log( TotalDocs / (DocsContaining(t) + 1) )
Example with 1000 documents:
- “The” appears in 990 docs → IDF ≈ log(1000/991) ≈ 0.004
- “blockchain” appears in 5 docs → IDF ≈ log(1000/6) ≈ 2.22
Final TF-IDF Weight
TF-IDF(t, d) = TF(t, d) × IDF(t)
How Cosine Similarity Works
Cosine similarity compares two vectors.
For high-dimensional vectors, the formula is:
cos(θ) = (A · B) / (||A|| × ||B||)
A · B= dot product||A||and||B||= vector magnitudes
Example: TF-IDF Similarity for Two Sentences
Python Test Code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def calculate_similarity(sentences):
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(sentences)
feature_names = vectorizer.get_feature_names_out()
similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()
return similarities, feature_names, tfidf_matrix
sentences = [
"I love reading books",
"I enjoy reading novels",
]
similarities, features, tfidf_matrix = calculate_similarity(sentences)
print(f"Base sentence: {sentences[0]}\n")
for i, similarity in enumerate(similarities, 1):
print(f"Sentence {i}: {sentences[i]}")
print(f"Similarity score: {similarity:.4f}\n")
print("Key words from base sentence (with significant TF-IDF weights):")
base_vector = tfidf_matrix[0].toarray()[0]
top_indices = base_vector.argsort()[-5:][::-1]
for idx in top_indices:
print(f"- {features[idx]}: {base_vector[idx]:.4f}")
Output
Base sentence: I love reading books
Sentence 1: I enjoy reading novels
Similarity score: 0.2020
Key words from base sentence (with significant TF-IDF weights):
- love: 0.6317
- books: 0.6317
- reading: 0.4494
- novels: 0.0000
- enjoy: 0.0000
The similarity score stays low because TF-IDF treats “love” and “enjoy” as unrelated words. It also treats “books” and “novels” as unrelated. Without synonym handling or semantic embeddings, TF-IDF cannot recognize the similarity in meaning.
Limitations of TF-IDF + Cosine Similarity
TF-IDF focuses on word statistics, not meaning. This design produces several weaknesses:
- The approach ignores semantics, grammar, and context.
- The approach does not handle synonyms.
- The approach does not understand word order.
For example, these two sentences contain the same words but express opposite meaning:
- “The cat chases the dog”
- “The dog chases the cat”
TF-IDF often reports high similarity because it mostly compares word overlap.
TF-IDF also reacts strongly to rare words. In a sports corpus, a rare player name may dominate the vector. That effect can reduce similarity even when two articles discuss the same event.
When These Methods Still Help
Despite their limitations, Levenshtein distance and TF-IDF + cosine similarity remain popular in large-scale web scraping data cleaning.
Teams choose them because they:
- run fast,
- scale well,
- and provide simple baselines.
What’s Next
In the next tutorial, we will cover more advanced text similarity detection methods that handle semantics more reliably.